57 research outputs found

    A Pan-cancer Somatic Mutation Embedding using Autoencoders

    Get PDF
    Background: Next generation sequencing instruments are providing new opportunities for comprehensive analyses of cancer genomes. The increasing availability of tumor data allows to research the complexity of cancer disease with machine learning methods. The large available repositories of high dimensional tumor samples characterised with germline and somatic mutation data requires advance computational modelling for data interpretation. In this work, we propose to analyze this complex data with neural network learning, a methodology that made impressive advances in image and natural language processing. Results: Here we present a tumor mutation profile analysis pipeline based on an autoencoder model, which is used to discover better representations of lower dimensionality from large somatic mutation data of 40 different tumor types and subtypes. Kernel learning with hierarchical cluster analysis are used to assess the quality of the learned somatic mutation embedding, on which support vector machine models are used to accurately classify tumor subtypes. Conclusions: The learned latent space maps the original samples in a much lower dimension while keeping the biological signals from the original tumor samples. This pipeline and the resulting embedding allows an easier exploration of the heterogeneity within and across tumor types and to perform an accurate classification of tumor samples in the pan-cancer somatic mutation landscape.Fil: Palazzo, Martin. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Parque Centenario. Instituto de Investigación en Biomedicina de Buenos Aires - Instituto Partner de la Sociedad Max Planck; Argentina. Universidad Tecnológica Nacional; ArgentinaFil: Beauseroy, Pierre. Université de Technologie de Troyes; FranciaFil: Yankilevich, Patricio. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Parque Centenario. Instituto de Investigación en Biomedicina de Buenos Aires - Instituto Partner de la Sociedad Max Planck; Argentin

    Learning Kernels from genetic profiles to discriminate tumor subtypes

    Get PDF
    Our work aims to perform the feature selection step on Multiple Kernel Learning by optimizing the Kernel Target Alignment score. It begins by building feature-wise gaussian kernel functions. Then by a constrained linear combination of the feature-wise kernels, we aim to increase the Kernel Target Alignment to obtain a new optimized custom kernel. The linear combination results in a sparse solution where only few kernels survive to improve KTA and consequently a reduced feature subset is obtained. Reducing considerably the original gene set allow to study deeper the selected genes for clinical purposes. The higher the KTA obtained, the better the feature selection, since we want to build custom kernels to use them for classification purposes later. The final kernel after optimizing the KTA is built by a linear combination of ‘Ki’ kernels, each one associated to a μi coefficient. The μ vector is computed during the optimization process.Sociedad Argentina de Informática e Investigación Operativ

    Learning Kernels from genetic profiles to discriminate tumor subtypes

    Get PDF
    Our work aims to perform the feature selection step on Multiple Kernel Learning by optimizing the Kernel Target Alignment score. It begins by building feature-wise gaussian kernel functions. Then by a constrained linear combination of the feature-wise kernels, we aim to increase the Kernel Target Alignment to obtain a new optimized custom kernel. The linear combination results in a sparse solution where only few kernels survive to improve KTA and consequently a reduced feature subset is obtained. Reducing considerably the original gene set allow to study deeper the selected genes for clinical purposes. The higher the KTA obtained, the better the feature selection, since we want to build custom kernels to use them for classification purposes later. The final kernel after optimizing the KTA is built by a linear combination of ‘Ki’ kernels, each one associated to a μi coefficient. The μ vector is computed during the optimization process.Sociedad Argentina de Informática e Investigación Operativ

    IDconverter and IDClight: Conversion and annotation of gene and protein IDs

    Get PDF
    Background: Researchers involved in the annotation of large numbers of gene, clone or protein identifiers are usually required to perform a one-by-one conversion for each identifier. When the field of research is one such as microarray experiments, this number may be around 30,000. Results: To help researchers map accession numbers and identifiers among clones, genes, proteins and chromosomal positions, we have designed and developed IDconverter and IDClight. They are two user-friendly, freely available web server applications that also provide additional functional information by mapping the identifiers on to pathways, Gene Ontology terms, and literature references. Both tools are high-throughput oriented and include identifiers for the most common genomic databases. These tools have been compared to other similar tools, showing that they are among the fastest and the most up-to-date. Conclusion: These tools provide a fast and intuitive way of enriching the information coming out of high-throughput experiments like microarrays. They can be valuable both to wet-lab researchers and to bioinformaticiansFunding has been provided by Fundación de Investigatión Médica Mutua Madrileña and Project TIC2003-09331-C02-02 of the Spanish Ministry of Education and Science (MEC). RD-U is partially supported by the Ramón y Cajal programme of the Spanish ME

    Learning Kernels from genetic profiles to discriminate tumor subtypes

    Get PDF
    Our work aims to perform the feature selection step on Multiple Kernel Learning by optimizing the Kernel Target Alignment score. It begins by building feature-wise gaussian kernel functions. Then by a constrained linear combination of the feature-wise kernels, we aim to increase the Kernel Target Alignment to obtain a new optimized custom kernel. The linear combination results in a sparse solution where only few kernels survive to improve KTA and consequently a reduced feature subset is obtained. Reducing considerably the original gene set allow to study deeper the selected genes for clinical purposes. The higher the KTA obtained, the better the feature selection, since we want to build custom kernels to use them for classification purposes later. The final kernel after optimizing the KTA is built by a linear combination of ‘Ki’ kernels, each one associated to a μi coefficient. The μ vector is computed during the optimization process.Sociedad Argentina de Informática e Investigación Operativ

    Hepatocellular Carcinoma tumor stage classification and gene selection using machine learning models

    Get PDF
    Cancer researchers are facing the opportunity to analyze and learn from big quantities of omic profiles of tumor samples. Different omic data is now available in several databases and the bioinformatics data analysis and interpretation are current bottlenecks. In this study somatic mutations and gene expression data from Hepatocellular carcinoma tumor samples are used to discriminate by Kernel Learning between tumor subtypes and early and late stages. This classification will allow medical doctors to establish an appropriate treatment according to the tumor stage. By building kernel machines we could discriminate both classes with an acceptable classification accuracy. Feature selection have been implemented to select the key genes which differential expression improves the separability between the samples of early and late stages.Special Issue dedicated to JAIIO 2018 (Jornadas Argentinas de Informática).Sociedad Argentina de Informática e Investigación Operativ

    Asterias: a parallelized web-based suite for the analysis of expression and aCGH data

    Get PDF
    Asterias (\url{http://www.asterias.info}) is an integrated collection of freely-accessible web tools for the analysis of gene expression and aCGH data. Most of the tools use parallel computing (via MPI). Most of our applications allow the user to obtain additional information for user-selected genes by using clickable links in tables and/or figures. Our tools include: normalization of expression and aCGH data; converting between different types of gene/clone and protein identifiers; filtering and imputation; finding differentially expressed genes related to patient class and survival data; searching for models of class prediction; using random forests to search for minimal models for class prediction or for large subsets of genes with predictive capacity; searching for molecular signatures and predictive genes with survival data; detecting regions of genomic DNA gain or loss. The capability to send results between different applications, access to additional functional information, and parallelized computation make our suite unique and exploit features only available to web-based applications.Comment: web based application; 3 figure

    An improved catalogue of putative synaptic genes defined exclusively by temporal transcription profiles through an ensemble machine learning approach

    Get PDF
    Background: Assembly and function of neuronal synapses require the coordinated expression of a yet undetermined set of genes. Previously, we had trained an ensemble machine learning model to assign a probability of having synaptic function to every protein-coding gene in Drosophila melanogaster. This approach resulted in the publication of a catalogue of 893 genes which we postulated to be very enriched in genes with a still undocumented synaptic function. Since then, the scientific community has experimentally identified 79 new synaptic genes. Here we use these new empirical data to evaluate our original prediction. We also implement a series of changes to the training scheme of our model and using the new data we demonstrate that this improves its predictive power. Finally, we added the new synaptic genes to the training set and trained a new model, obtaining a new, enhanced catalogue of putative synaptic genes. Results: The retrospective analysis demonstrate that our original catalogue was significantly enriched in new synaptic genes. When the changes to the training scheme were implemented using the original training set we obtained even higher enrichment. Finally, applying the new training scheme with a training set including the 79 new synaptic genes, resulted in an enhanced catalogue of putative synaptic genes. Here we present this new catalogue and announce that a regularly updated version will be available online at: Http://synapticgenes.bnd.edu.uy Conclusions: We show that training an ensemble of machine learning classifiers solely with the whole-body temporal transcription profiles of known synaptic genes resulted in a catalogue with a significant enrichment in undiscovered synaptic genes. Using new empirical data provided by the scientific community, we validated our original approach, improved our model an obtained an arguably more precise prediction. This approach reduces the number of genes to be tested through hypothesis-driven experimentation and will facilitate our understanding of neuronal function. Availability: Http://synapticgenes.bnd.edu.uyFil: Pazos Obregón, Flavio. Instituto de Investigaciones Biológicas "Clemente Estable"; UruguayFil: Palazzo, Martin. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Parque Centenario. Instituto de Investigación en Biomedicina de Buenos Aires - Instituto Partner de la Sociedad Max Planck; ArgentinaFil: Soto, Pablo. Instituto de Investigaciones Biológicas "Clemente Estable"; UruguayFil: Guerberoff, Gustavo. Universidad de la República; UruguayFil: Yankilevich, Patricio. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Parque Centenario. Instituto de Investigación en Biomedicina de Buenos Aires - Instituto Partner de la Sociedad Max Planck; ArgentinaFil: Cantera, Rafael. Instituto de Investigaciones Biológicas "Clemente Estable"; Urugua

    GenIO: A phenotype-genotype analysis web server for clinical genomics of rare diseases

    Get PDF
    Background: GenIO is a novel web-server, designed to assist clinical genomics researchers and medical doctors in the diagnostic process of rare genetic diseases. The tool identifies the most probable variants causing a rare disease, using the genomic and clinical information provided by a medical practitioner. Variants identified in a whole-genome, whole-exome or target sequencing studies are annotated, classified and filtered by clinical significance. Candidate genes associated with the patient's symptoms, suspected disease and complementary findings are identified to obtain a small manageable number of the most probable recessive and dominant candidate gene variants associated with the rare disease case. Additionally, following the American College of Medical Genetics and Genomics and the Association of Molecular Pathology (ACMG-AMP) guidelines and recommendations, all potentially pathogenic variants that might be contributing to disease and secondary findings are identified. Results: A retrospective study was performed on 40 patients with a diagnostic rate of 40%. All the known genes that were previously considered as disease causing were correctly identified in the final inherit model output lists. In previously undiagnosed cases, we had no additional yield. Conclusion: This unique, intuitive and user-friendly tool to assists medical doctors in the clinical genomics diagnostic process is openly available at https://bioinformatics.ibioba-mpsp-conicet.gov.ar/GenIO/.Fil: Koile, Daniel Isaac. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Parque Centenario. Instituto de Investigación en Biomedicina de Buenos Aires - Instituto Partner de la Sociedad Max Planck; ArgentinaFil: Córdoba, Marta. Gobierno de la Ciudad de Buenos Aires. Hospital General de Agudos "Ramos Mejía"; Argentina. Universidad Austral. Facultad de Ciencias Biomédicas. Instituto de Investigaciones en Medicina Traslacional. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Parque Centenario. Instituto de Investigaciones en Medicina Traslacional; ArgentinaFil: de Sousa Serro, Maximiliano Guillermo. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Parque Centenario. Instituto de Investigación en Biomedicina de Buenos Aires - Instituto Partner de la Sociedad Max Planck; ArgentinaFil: Kauffman, Marcelo Andres. Gobierno de la Ciudad de Buenos Aires. Hospital General de Agudos "Ramos Mejía"; Argentina. Universidad Austral. Facultad de Ciencias Biomédicas. Instituto de Investigaciones en Medicina Traslacional. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Parque Centenario. Instituto de Investigaciones en Medicina Traslacional; ArgentinaFil: Yankilevich, Patricio. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Parque Centenario. Instituto de Investigación en Biomedicina de Buenos Aires - Instituto Partner de la Sociedad Max Planck; Argentin

    Asterias: A Parallelized Web-based Suite for the Analysis of Expression and aCGH Data

    Get PDF
    The analysis of expression and CGH arrays plays a central role in the study of complex diseases, especially cancer, including finding markers for early diagnosis and prognosis, choosing an optimal therapy, or increasing our understanding of cancer development and metastasis. Asterias (http://www.asterias.info) is an integrated collection of freely-accessible web tools for the analysis of gene expression and aCGH data. Most of the tools use parallel computing (via MPI) and run on a server with 60 CPUs for computation; compared to a desktop or server-based but not parallelized application, parallelization provides speed ups of factors up to 50. Most of our applications allow the user to obtain additional information for user-selected genes (chromosomal location, PubMed ids, Gene Ontology terms, etc.) by using clickable links in tables and/or figures. Our tools include: normalization of expression and aCGH data (DNMAD); converting between different types of gene/clone and protein identifiers (IDconverter/IDClight); filtering and imputation (preP); finding differentially expressed genes related to patient class and survival data (Pomelo II); searching for models of class prediction (Tnasas); using random forests to search for minimal models for class prediction or for large subsets of genes with predictive capacity (GeneSrF); searching for molecular signatures and predictive genes with survival data (SignS); detecting regions of genomic DNA gain or loss (ADaCGH). The capability to send results between different applications, access to additional functional information, and parallelized computation make our suite unique and exploit features only available to web-based applications
    corecore